Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
debakarr
GitHub Repository: debakarr/machinelearning
Path: blob/master/Part 10 - Model Selection And Boosting/XGBoost/[R] XGBoost.ipynb
1350 views
Kernel: R

XGBoost

Data preprocessing

# Importing the dataset dataset = read.csv('Churn_Modelling.csv')
head(dataset, 5)
dataset = dataset[4:14] dataset = dataset[-c(3)]
head(dataset, 5)
# Encoding the categorical variables as factors dataset$Geography = as.numeric(factor(dataset$Geography, levels = c('France', 'Spain', 'Germany'), labels = c(1, 2, 3)))
# Splitting the dataset into the Training set and Test set # install.packages('caTools') library(caTools) set.seed(1234) split = sample.split(dataset$Exited, SplitRatio = 0.8) training_set = subset(dataset, split == TRUE) test_set = subset(dataset, split == FALSE)

Fitting XGBoost to the Training set

# install.packages('xgboost') library(xgboost)
classifier = xgboost(data = as.matrix(training_set[-10]), label = training_set$Exited, nrounds = 10)
[1] train-rmse:0.417976 [2] train-rmse:0.369643 [3] train-rmse:0.341009 [4] train-rmse:0.325538 [5] train-rmse:0.316370 [6] train-rmse:0.309533 [7] train-rmse:0.306012 [8] train-rmse:0.302703 [9] train-rmse:0.300868 [10] train-rmse:0.298456

Its better to think that the converging point is 0.30. But it is worth noting that we can decrease the train-rmse(which stands for root mean square deviation) if we keep on inscreasing nrounds. I tried it myself but the result will be too long to display here. Better try it by running this (below code is just for cross-validation to assess model performance):

set.seed(1234) cv = xgb.cv(data = as.matrix(training_set[-10]), label = training_set$Exited, nfold = 5, nrounds = 500)

Also if you don't want to display train-rmse then pass verbose = 0 as a parameter.


Applying k-Fold Cross Validation

library(caret)
Loading required package: lattice Loading required package: ggplot2
folds = createFolds(training_set$Exited, k = 10) cv = lapply(folds, function(x){ training_fold = training_set[-x, ] test_fold = training_set[x, ] classifier = xgboost(data = as.matrix(training_set[-10]), label = training_set$Exited, nrounds = 10) y_fold_pred = predict(classifier, newdata = as.matrix(test_fold[-10])) y_fold_pred = (y_fold_pred > 0.5) cm = table(test_fold[, 10], y_fold_pred) accuracy = (cm[1, 1] + cm[2, 2])/(sum(cm)) return(accuracy) })
[1] train-rmse:0.417976 [2] train-rmse:0.369643 [3] train-rmse:0.341009 [4] train-rmse:0.325538 [5] train-rmse:0.316370 [6] train-rmse:0.309533 [7] train-rmse:0.306012 [8] train-rmse:0.302703 [9] train-rmse:0.300868 [10] train-rmse:0.298456 [1] train-rmse:0.417976 [2] train-rmse:0.369643 [3] train-rmse:0.341009 [4] train-rmse:0.325538 [5] train-rmse:0.316370 [6] train-rmse:0.309533 [7] train-rmse:0.306012 [8] train-rmse:0.302703 [9] train-rmse:0.300868 [10] train-rmse:0.298456 [1] train-rmse:0.417976 [2] train-rmse:0.369643 [3] train-rmse:0.341009 [4] train-rmse:0.325538 [5] train-rmse:0.316370 [6] train-rmse:0.309533 [7] train-rmse:0.306012 [8] train-rmse:0.302703 [9] train-rmse:0.300868 [10] train-rmse:0.298456 [1] train-rmse:0.417976 [2] train-rmse:0.369643 [3] train-rmse:0.341009 [4] train-rmse:0.325538 [5] train-rmse:0.316370 [6] train-rmse:0.309533 [7] train-rmse:0.306012 [8] train-rmse:0.302703 [9] train-rmse:0.300868 [10] train-rmse:0.298456 [1] train-rmse:0.417976 [2] train-rmse:0.369643 [3] train-rmse:0.341009 [4] train-rmse:0.325538 [5] train-rmse:0.316370 [6] train-rmse:0.309533 [7] train-rmse:0.306012 [8] train-rmse:0.302703 [9] train-rmse:0.300868 [10] train-rmse:0.298456 [1] train-rmse:0.417976 [2] train-rmse:0.369643 [3] train-rmse:0.341009 [4] train-rmse:0.325538 [5] train-rmse:0.316370 [6] train-rmse:0.309533 [7] train-rmse:0.306012 [8] train-rmse:0.302703 [9] train-rmse:0.300868 [10] train-rmse:0.298456 [1] train-rmse:0.417976 [2] train-rmse:0.369643 [3] train-rmse:0.341009 [4] train-rmse:0.325538 [5] train-rmse:0.316370 [6] train-rmse:0.309533 [7] train-rmse:0.306012 [8] train-rmse:0.302703 [9] train-rmse:0.300868 [10] train-rmse:0.298456 [1] train-rmse:0.417976 [2] train-rmse:0.369643 [3] train-rmse:0.341009 [4] train-rmse:0.325538 [5] train-rmse:0.316370 [6] train-rmse:0.309533 [7] train-rmse:0.306012 [8] train-rmse:0.302703 [9] train-rmse:0.300868 [10] train-rmse:0.298456 [1] train-rmse:0.417976 [2] train-rmse:0.369643 [3] train-rmse:0.341009 [4] train-rmse:0.325538 [5] train-rmse:0.316370 [6] train-rmse:0.309533 [7] train-rmse:0.306012 [8] train-rmse:0.302703 [9] train-rmse:0.300868 [10] train-rmse:0.298456 [1] train-rmse:0.417976 [2] train-rmse:0.369643 [3] train-rmse:0.341009 [4] train-rmse:0.325538 [5] train-rmse:0.316370 [6] train-rmse:0.309533 [7] train-rmse:0.306012 [8] train-rmse:0.302703 [9] train-rmse:0.300868 [10] train-rmse:0.298456
mean(as.numeric(cv)) # accuracy
sd(as.numeric(cv)) # Standard deviation